10. Analyzing Dataset for High Cardinality

Analyzing Dataset for High Cardinality

ND320 AIHCND C01 L01 A10 Analyzing Dataset For High Cardinality

High Cardinality

Cardinality: refers to the number of unique values that a feature has and is relevant to EHR datasets because there are code sets such as diagnosis codes in the order of tens of thousands of unique codes. This only applies to categorical features and the reason this is a problem is that it can increase dimensionality and makes training models much more difficult and time-consuming.

How do we define a field with high cardinality?

  • Determine if it is a categorical feature.
  • Determine if it has a high number of unique values. This can be a bit subjective but we can probably agree that for a field with 2 unique values would not have high cardinality whereas a field like diagnosis codes might have tens of thousands of unique values would have high cardinality.
  • Use the nunique() method to return the number of unique values for the categorical categories above.

Additional Resources

Reducing Dimensionality

High Cardinality Quiz

Which of the following might be considered to have high cardinality?

SOLUTION:
  • Principal diagnosis code
  • Zip Code

Code

If you need a code on the https://github.com/udacity.